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^ ■ Abstract 

In (4), Chaudhuri, Chen, Mihaescu and Rao study algorithmic properties of the 
tandem duplication - random loss model of genome rearrangement, well-known in 
evolutionary biology. In their model, the cost of one step of duplication-loss of width 
k is a k for a = 1 or a > 2. In this paper, we study a variant of this model, where 
the cost of one step of width k is 1 if k < K and oo if k > K, for any value of the 
parameter K 6 N U {oo}. We first show that permutations obtained after p steps 
of width K define classes of pattern-avoiding permutations. We also compute the 
numbers of duplication-loss steps of width K necessary and sufficient to obtain any 
permutation of S n , in the worst case and on average. In this second part, we may 
also consider the case K = K(n), a function of the size n of the permutation on 
^j. ■ which the duplication-loss operations are performed. 

(N 

^C^ ■ Key words: Sorting, Permutations, Pattern 

PACS: 

O 

OO 

o 



O 

U 



1 Introduction 



1.1 The model 



In the usual models of genome rearrangement, duplications and losses of genes 
are not taken into account. There were attempts to incorporate them to the 
classical models, but the consecutive combinatorial complexity of the mod- 
els so obtained made their study quite difficult. Following (4), we focus on 
the duplication-loss problem by considering the tandem duplication - random 
loss model of genome rearrangement in which genomes are modified only by 
duplications and losses of genes. 

One step of tandem duplication - random loss, or duplication- loss for short, 
consists in (1) the tandem duplication of a contiguous fragment of the genome, 
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i.e., the duplicated fragment is inserted immediately after the original frag- 
ment, and (2) the loss of one of the two copies of every duplicated gene. We 
assume that the loss occurs immediately after the duplication of genes, which 
is, on an evolutionary time-scale, a good approximation to reality. The width 
of a step is the number of duplicated genes. See Figure 1 for an example. 



1234567^12345634567 
(tandem duplication) 

~»1 2^4 5X3/X6 7 
(random loss) 

->1 2 4 5 3 6 7 



Fig. 1. Example of one step of tandem duplication - random loss of width 4 

From a formal point of view, a genome consisting of n genes is modelled by 
a permutation n E S n of the set of integers {1, 2, . . . , n}. In (4), the authors 
define the cost of a duplication- loss step of width k to be a k , a > 1 being a 
constant parameter. They suggest that other cost functions can be considered, 
and in particular affine functions. In this paper, we consider a piecewise con- 
stant cost function: the cost of a step of width k is 1 if k < K and is infinite 
for k > K, for some fixed parameter KgNU {oo}. Obviously, for this model 
to be meaningful, we assume that K > 2. We also consider the possibility 
that K = K(n) is dependent on the size n of the permutation on which the 
duplication-loss operations are performed. Both models are generalizations of 
the whole genome duplication - random loss model: it corresponds to the case 
a — 1 in the model of (4), K — oo or K — K{n) = n in our model. 

Many models of evolution of permutations are inspired by computational bi- 
ology issues: see (2), (5), (6), (7) for examples in the literature. 

Our model of evolution of permutations can be viewed in the framework of 
permuting machines defined in (1). Such a machine takes a permutation in 
input, and transforms it into an output permutation, the transformation being 
subject to satisfy the two properties of independence with respect to the values 
and of stability with respect to pattern-involvement (see (1) for more details). 
The important point is that the duplication-loss transformation satisfies these 
two properties. Thus, one duplication-loss step (in one of the models defined 
above) corresponds to running an adequate permuting machine once. When we 
will consider permutations obtained after a sequence of duplication-loss steps, 
it will correspond to permutations obtained in the output of a combination in 
series of identical permuting machines. 

For ease of exposition in some proofs, we will sometimes use a graphical rep- 
resentation of permutations, as shown in Figure 2. 
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Fig. 2. The graphical representation of a = 68135427 



1.2 Pattern- avoiding classes of permutations 



Though not appearing clearly for the moment, there exist strong links between 
the duplication-loss model and some pattern-avoiding classes of permutations. 
Hence, we need to recall a few definitions concerning those classes. 

A permutation a G S n is a bijective map from [l..n] to itself. The integer n 
is called the size of a, denoted \a\. We denote by Oi the image of i under a. 
A permutation can be seen as a word o\0~ 2 . . .o n containing exactly once each 
letter i G [l..n\. For each entry cr, of a permutation a, we call i its position 
and <7j its mfate. 

Definition 1 A permutation it £ Sk is a pattern of a permutation o~ G S n 
if there is a subsequence of a which is order-is omorphic to it; in other words, 
if there is a subsequence a^a^ . . .Oi k of a (with 1 < i\ < i 2 < . . . < i& < n) 
such that a ie < a im whenever ix^ < n m . 

We also say that ir is involved in a and call cr^cr^ . . . a ik an occurrence of it 
in a. 

We write n -< a to denote that 7r is a pattern of a. 

A permutation a that does not contain n as a pattern is said to avoid ir. 
The class of all permutations avoiding the patterns 7Ti , 7r 2 . . . iik is denoted 
S(tti, 7T2, . . . , 7r fc ), and S n (ni, 7r 2 , . . . , ilk) denotes the set of permutations of size 
n avoiding 7Ti, 7t 2 , . . . , 7ik- We say that Sfa, n 2 , . . . , n k ) is a class of pattern- 
avoiding permutations of basis {tti, ir 2 , . . . , itk}. 

Example 2 For example a = 142563 contains the pattern 1342, and 1563, 
1463, 2563 and 1453 are the occurrences of this pattern in a. But a G S , (321): 
a avoids the pattern 321 as no subsequence of size 3 of a is isomorphic to 321, 
i.e., is decreasing. 
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1.3 Outline of the paper 



In the tandem duplication - random loss model described above, we will focus 
on two kinds of problems. First, as hinted before, we will consider permu- 
tations obtained after a certain number of duplication-loss steps, that is to 
say permutations in output of a combination in series of a certain number of 
permuting machines. For this, we define the class C(K,p) as follows: 

Definition 3 The class C(K,p) denotes the class of all permutations obtained 
from 12 ... n (for any n) after p duplication-loss steps of width at most K , for 
some constant parameters p and K . 

We do not consider the case K = K{n) here. 

Be careful that the duplication-loss steps are not reversible, as noticed in (4), 
and that consequently C(K,p) is not the class of permutations that can be 
sorted to 12 ... n in p steps of duplication-loss of width at most K. 

Like for the various classes of permutations obtained after a combination in 
series of permuting machines considered in (1), we obtained combinatorial 
properties of C(K,p) in terms of pattern- avoidance. Namely, we show that 
C(K,p) is a class of pattern-avoiding permutations. In the case p = 1 (Section 
2.2), we give a precise description of the basis B of excluded patterns: B = 
{321, 3142, 2143} U D, D being the set of all permutations of S K+1 that do 
not start with 1 nor end with K + 1, and containing exactly one descent. In 
particular, B is of cardinality 3 + 2 K ~ 1 and contains patterns of size at most 
K + 1. For the general case (Section 2.3), we cannot get such a precise result 
but only a bound on the size of the excluded patterns: we show that C(K,p) 
is a class of pattern-avoiding permutations whose basis contains patterns of 
size at most [Kp + 2) 2 — 2. 

A second point of view is to examine how many steps of a given width are 
necessary to obtain any permutation of S n starting from 12 ... n. Namely in 
Section 3 we fix a width K (constant, or K = K{n)) and a size n and search for 
the number p such that any permutation of S n can be obtained from 12 ... n in 
at most p duplication-loss steps of width at most K. We describe an algorithm 
computing a possible scenario of duplications and losses for any 7r G S n , this 
scenario involving B^log-fT + -^) duplication-loss steps in the worst case 
and on average. We also show that f2(logn + jp) steps are necessary (in the 
worst case and on average) to obtain any permutation of S n from 12 ... n. 
These upper and lower bounds coincide in most cases. 
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2 Characterization with excluded patterns 

Before focusing on the classes C(K, 1) and C(K,p) defined for our model, we 
will get back to the simpler whole genome duplication - random loss model 
(corresponding to K = oo in our model, but defined previously by other 
authors). We will not prove new theorems, but will interprete the existing 
results from the pattern-avoidance point of view. 

2. 1 The whole genome duplication - random loss model through the pattern- 
avoidance prism 

Let us recall that in the whole genome duplication - random loss model, any 
duplication-loss step has cost 1, so that we can consider w.l.o.g that the du- 
plicated fragment is the whole permutation at any step. The cost of obtaining 
a permutation a G S n from the identity is just the minimal number of steps 
of a duplication-loss scenario transforming 12 ... n into a. 

A statistics of permutations that matters for our purpose is their number of 

descents. 

Definition 4 Given a permutation a of size n, we say that there is a descent 
(resp. ascent ) at position i, 1 < i < n — 1, if o~i > a i+ i (resp. o-; t < a i+1 ). We 
write desc(a) the number of descents of the permutation a. 

Example 5 For example, a = 524316 has 3 descents, namely at positions 1, 

3 and 4. 

A permutation a of size n has at most n — 1 descents, the case of n — 1 descents 
exactly corresponding to the reversed identity permutation n(n — 1) ... 21. It 
is also of common knowledge that the average number of descents among 
permutations of size n is 

In (4), the authors prove the following theorem. 

Theorem 6 Let a G S n . In the whole genome duplication - random loss 
model, |~log 2 (desc(a) + 1)] steps are necessary and sufficient to obtain a from 
12... n. 

It is equivalent to say that the permutations that can be obtained in at most p 
steps in the whole genome duplication - random loss model are exactly those 
whose number of descents is at most 2 P — 1. 

Now, we can notice that the property of being obtainable in at most p steps is 



5 



stable for the pattern- involvement relation -<: if a can be obtained in at most 
p steps, and if it -< a, then n can also be obtained in at most p steps. Indeed, 
it is enough to perform the same duplication-loss scenario on a, keeping track 
only of the elements of a that form an occurrence of ir. This stability for -< 
implies that the class of permutations obtainable in at most p steps is a class 
of pattern-avoiding permutations, whose excluded patterns are the minimal 
(again in the sense of -<) permutations that cannot be obtained in p steps. 

Then, by Theorem 6, the excluded patterns are the minimal permutations 
with 2 P descents. We initiated a study of the minimal permutations with d 
descents in (3). However, it is simple to notice that a permutation with d 
descents and minimal for this criterion has size at most 2d, since it does not 
contain to consecutive ascents by minimality. An immediate consequence is 
that the number of excluded patterns is finite. 

This allows us to state the following version of Theorem 6: 

Theorem 7 The permutations that can be obtained in at most p steps in the 
whole genome duplication - random loss model form a class of pattern- avoiding 
permutations. The excluded patterns are the permutations with exactly 2 P de- 
scents that are minimal (in the sense of -<) for this criterion. These excluded 
patterns are in finite number. 

In (3), we will give a simpler description and some properties of these minimal 
permutations with d descents. 

2.2 Permutations obtained in one step of width K 

As an introduction to the study of C(K, p), we deal in this section with the sim- 
pler case of the class C{K) = C(K, 1) of permutations obtained from 12 . . .n 
in one duplication-loss step of width at most K. Assume in this section that 
the parameter K > 2 is fixed. Throughout this section, when referring to a 
duplication-loss step, we always mean duplication-loss step of width K, except 
when otherwise explicitly stated. 

It is easily noticed that any permutation of C(K) cannot have more than one 
descent. Conversly, any permutation of size at most K having exactly one 
descent belongs to C(K). 

Although it is a technical point of importance in the proof of Theorem 9, the 
following proposition comes straightforward: 

Proposition 8 The permutations of size K + 1 that do not belong to C(K) 
and having exactly one descent are exactly those of Sk+i with one descent that 
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do not start with 1 nor end with K + 1. 

PROOF. Let a = aia 2 . . . Ok+i be a permutation of size K + 1 that does 
not belong to C(K) but has exactly one descent. Now, if o\ = 1, then a = 
o~2 ■ ■ ■ o~k+i is a permutation (of {2, 3, . . . , K+l}) of size K having one descent, 
and therefore a can be obtained from 23 ... K + 1 in one duplication-loss 
step. Applying the same transformation to 123 . . . K + 1 will then produce a, 
contradicting that a C(K). The same reasoning holds when ok+i —K+l. 
So a does not start with 1 nor end with K+l. 

Now if a is a permutation of size K + l having exactly one descent, that does 
not start with 1 nor end with K + l, we claim that a cannot be obtained from 
12 ... K + 1 in one duplication-loss step. This is because no duplication-loss 
step of width K can move both 1 and K+l in 12 ... K + 1. 

Theorem 9 The class C(K) of permutations obtained from 12 ... n (for some 
n > 1) in one duplication-loss step of width K is a class S(B) of pattern- 
avoiding permutations whose basis B is composed of 3 + 2 K ~ 1 patterns of 
size at most K + l. Namely B = {321,3142,2143} U D, D being the set of 
all permutations of Sk+i that do not start with 1 nor end with K + l, and 
containing exactly one descent. 

Example 10 C(4) = 

5(321, 3142, 2143, 23451, 23514, 24513, 34512, 25134, 35124, 45123, 51234) 

PROOF. We prove the reversed statement: a ^ S(B) if and only if a cannot 
be obtained from an identity permutation in one duplication-loss step of width 
K. 

Assume a £ S(B). Then there exists b e B such that b -< a. If b = 321, 
3142 or 2143, then a has at least 2 descents and cannot be obtained in one 
duplication-loss step. Otherwise, using Proposition 8, there exists p G Sk+i 
such that p -< a and p C(K). Now if a could be obtained in one duplication- 
loss step, then so would be p, yielding a contradiction. So a ^ C(K). 

Conversly, assume that a ^ C(K). If a contains at least 2 descents, then 
a contains an occurrence of 321 or 3142 or 2143, since these three are the 
minimal permutations (in the sense of the relation -<) with 2 descents. And 
consequently, a S(B). Thus we may assume that a has exactly one descent. 
We decompose a G S n into a — 12 . . .pidp2(p2 + 1) • • • n, where a is a per- 
mutation of the set {p\ + l,pi + 2 . . . ,p 2 — 1} that does not start with pi + 1 
nor end with p 2 — 1, and contains exactly one descent. This decomposition is 
shown in Figure 3. We denote by K the size of a. Since a ^ C(K), necessarily 
K > K + 1 or we would get a contradiction. If K — K + 1, we get that a is 
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an occurrence of some pattern of D C B in a. As a consequence, a S(B). 
What is left to prove is that this extends to the case K > K + 1. We just 
need to show that we can remove elements in a without violating any of the 
properties below: 

• the permutation does not start with its smallest element 

• the permutation does not end with its greatest element 

• the permutation has exactly one descent 

until we get a permutation of size K+l. At that point a contains an occurrence 
of a pattern in D, and so does a, and we get that a ^ S(B). Now, because of 
the conditions on a, the only descent in a necessarily goes from the greatest 
to the smallest element in a, ensuring that it is possible to remove elements 
without violating any of the properties above (see Figure 3). 

Decomposition of a Shape of a 




12-- -PI (j P2- ■ -n 



Fig. 3. Decomposition a = 12... Pidp2(p2 + 1) ... n on the graphical representation 
of a, and shape of a 

2.3 Permutations obtained in p steps of width K 

As for the case of C(K, 1) in Section 2.2, we prove (Theorem 19) in this section 
that the class C(K,p) of all permutations obtained from an identity permu- 
tation after p duplication-loss steps of width at most K is a class of pattern- 
avoiding permutations. However, we do not get a precise description of the 
basis of this class, but only an upper bound on the size of the excluded pat- 
terns. As in the previous section, when referring to a duplication-loss step, 
we always mean duplication-loss step of width K, except when otherwise ex- 
plicitely stated. 

To prove the announced result, we will need a few more notations and technical 
lemmas. 

The vector from % to j in a permutation a consists of all elements whose 
positions lie between the positions of % and j, i and j being included. The size 
of a vector is the number of ele ments in it. For example, the vector from 7 to 
2 in the permutation 4123576 is 2357, and has size 4. 
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Definition 11 Let a be a permutation of S n . The value-position vector asso- 
ciated with i G [l..n] (vp- vector for short) is the vector of a going from i to 
o~i, if i is not a fixpoint of a. In the case i = Oi, the vp-vector associated with 
i is empty. 

It should appear in this definition that the vp- vector associated with i, going 
from the element of o which has value % to the element of o at position i, 
represents the necessary move for i to reach its position in the sorted permu- 
tation 12 ... n. As it can be seen on Figure 4, on the graphical representation 
of permutations used throughout the paper, the fp-vector associated with i is 
an arrow going horizontally from the element at ordinate % to the diagonal. 

We can also notice that a non-empty vp- vector contains at least two elements. 

To take into account all the moves necessary to sort a to 12 ... n, it is conve- 
nient to introduce the value-position domain: 

Definition 12 Let a be a permutation of S n . The value-position domain of 
a ( -up-domain for short) is composed of all elements of a appearing in at least 
one vp-vector. 

These two definitions are illustrated on Figure 4. 



o = 4 123576 



fp-domain of a 
= {1,2,3,4,6,7} 



Fig. 4. up-vectors and up-domain for a = 4123576, in the usual and in the graphical 
representations 

Now, observe that for any permutation, the tp-vectors are reversible in the 
sense that reversing all the arrows will give a set of vectors that represent the 
moves of elements that are necessary to "unsort" 12 . . .n into a. It is easily 
seen from Definitions 11 and 12 and this remark that for any permutation a G 
C(K,p), any element belonging to the vp-domain of a also belongs to at least 
one of the duplication-loss steps used to obtain a from 12 ... n. Consequently, 
the vp-domain of a contains at most Kp elements. 

Lemma 13 Consider a permutation a, and the permutation t obtained from 
a by the removal of some element j . Then for any element i ^ j such that 
i ^ either this element becomes a fixpoint in r or the size of the vp-vector 
associated with this element in r remains constant, is increased of 1 or is 
diminished of 1 with respect to the size of the vp-vector associated with i in a. 
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PROOF. It is easily seen on the graphical representation of a . Any element 
that does not lie just above or just below the diagonal cannot become a fix- 
point when removing an element j. For elements that do not becom fixpoints, 
the horizontal distance to the diagonal can only change of 0, 1 or —1 when 
removing some element j (see Figure 5). 



□ Diagonal 

ES Candidate fixpoints 

T .Changes 

m the fp-vectors 

Variation of ,the distance 
to the diagonal 

□ ■ +1 ^-1 

Fig. 5. Variation of the size of vp-vectors due to the removal of an element j above 
or below the diagonal. 

Lemma 14 For any permutation a, there is at least one element j such that 
the permutation r obtained from a by the removal of j contains at most one 
more fixpoint than a. 




PROOF. It is convenient to introduce the quasi- diagonal elements of a, de- 
fined as follows. % is a quasi-diagonal element of a if ovi = i or (Tj + i = i. 
These two cases correspond respectively to elements of a lying just above or 
just below the diagonal in the graphical representation of o. Any element of 
a that may become a fixpoint in r is necessarily a quasi-diagonal element. 

If there is no quasi-diagonal element, then we can remove any element j to 
obtain a permutation r that does not have more fixpoints than a. If there are 
some, then we pick j among the quasi-diagonal elements. We claim that at 
most one fixpoint is create while removing j. The argument is simple. Suppose 
j is such that <jj_i = j, the other case being similar. Then the only fixpoint 
that may appear is j — 1, if <Jj — j — 1. This should appear clearly on Figure 
6. 




□ Diagonal 

• Removed element 
E3 Candidate fixpoint 



Fig. 6. The only fixpoint that can appear when removing a quasi-diagonal element. 
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Lemma 15 Consider a permutation a ^ C(K,p) such that for any strict 
pattern r of a, r G C(K,p). Then the vp-domain of a is of size at most 
2Kp + 2. 

PROOF. By Lemma 14, we can choose some r -< a with |r| + 1 = \a\ and 
such that r has at most one more fixpoint than a. Call j the element deleted 
in o to obtain r. By a previous remark, since r G C(K,p), the vp-domain of 
r is of size at most Kp, and is therefore composed of at most Kp v p-vectors. 
Each of these v p-vectors in r yields a vp-vector in a, whose size is smaller or 
equal or possibly increased by 1. Let us denote by V the set of vp-vectors of a 
obtained from a vp-vector of r. Then the number of elements of a that belong 
to a vp- vector of V is at most 2Kp. However V is not yet the vp-domain of a. 
We must complete it with up to two vp-vectors: the one associated with the 
element j deleted, and the one associated with the fixpoint of r that was not a 
fixpoint in a, if such a point exists. If such an element exists, then it is a quasi- 
diagonal element in a and its vp-vector (denoted v ) in a is necessarily of size 
2, so that V U has total size at most 2Kp + 2. Now it is easily observed 
that any element of a belonging to one vp- vector necessarily belongs to at least 
two vp- vectors (this can be seen as a "balance condition"). Consequently, all 
the elements of the vp- vector associated with j are already covered by a vector 
of V U {~v}, so that the vp-domain of a is exactly the set of elements covered 
by V U {~v}. Therefore, its size is at most 2Kp + 2. 

Lemma 16 Consider a permutation a C(K,p) of size n > (Kp + 2) 2 — 2 
such that for any strict pattern t of a, t G C(K,p). Then a is of the form 
a = Ii(i+\) . . . (i+Kp)J with I a permutation of[l..i — 1] and J a permutation 
of [i + Kp + l..n}. It is possible that I or J is empty. 

PROOF. By Lemma 15, the vp-domain of a is of size at most 2Kp + 2. We 
can decompose a into free windows of consecutive elements outside the vp- 
domain of a, separated by windows of consecutive elements of the vp-domain. 
Now, there are at most Kp + 1 windows of consecutive elements of the vp- 
domain, and consequently, there are at most Kp + 2 free windows in a. Since 
a is of size n > (Kp + 2) 2 - 2 = (Kp + 2)Kp + 2Kp + 2, at least one of the free 
windows of a has size strictly greater than Kp, i.e., contains at least Kp + 1 
elements. By definition, these elements do not belong to the vp-domain of a, 
and hence they allow the decomposition of a into a — I%(% + 1) . . . (i + Kp)J 
with I a permutation of — 1] and J a permutation of [i + Kp + l..n\. 
Figure 7 represent the decomposition of a used in this proof. 

Lemma 17 Consider a permutation a = a'(j + + 2) . . . n where a' is a 
permutation of [1 . . .j]. If a is obtainable after p duplication steps of size at 
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at most Kp + 1 v p-windows 
a =LZC 




at most Kp + 2 free-windows 
Fig. 7. Proof of Lemma 16 

most K then o is obtainable after p duplication steps of size at most K such 
that the duplicated window for each step does not intersect 



PROOF. The key idea is to consider the first sequence sx, S2, ■ ■ ■ , s p of duplication- 
loss steps and create a new sequence s[, s' 2 , ■ ■ ■ , s' p such that : 

• Each step concerns only elements of 

• After every step s^, the elements 1,2, ... ,j are in the same order than after 
performing steps sx, S2, ■ ■ ■ , Sj. 

Then the proof is by induction on the number of steps. If there is only one 
step then the proof is straighforward. Suppose now that the above statement 
is true until p — 1 steps. Then for the last step, we use our hypothesis for p — 1 
so that we have operations s[, s' 2 , ■ ■ ■ , Sp_ x respecting the above conditions. For 
s' n , only notice that the elements of [1 . . . j] involved in s n are also in a window 
of size K in the permutation obtained after s^_ 1 and in the same relative order 
by our induction hypothesis which proves the existence of s' n . 



Using these lemmas, we state and prove a key proposition: 

Proposition 18 Consider a permutation a C(K,p). Then either o is of 
size at most (Kp + 2) 2 — 2, or there exists a strict pattern r of a that does not 
belong to C(K,p). 



PROOF. Consider a permutation o ^ C(K,p) such that any strict pattern r 
of a belongs to C(K,p). We want to show that a is of size n < (Kp+2) 2 —2. Let 
us assume the contrary. By Lemma 16, there exist i G [l..Ji], / a permutation 
of [l..z — 1] and J a permutation of [i + Kp + l..n] such that a = Ii(i + 
1) . . . (i + Kp)J . Let us denote a the permutation a — Ii(i + 1) . . . (i + Kp — 
1)(J — 1), where (J — 1) is the permutation of [i + Kp..n — 1] obtained from 
J by subtracting 1 to every element of J. a is a strict pattern of a, hence 
cr G C(K,p). Consider a shortest sequence of duplication-loss steps of width 
at most K that produces a from 12 ... (n — 1). This sequence has at most 
p steps, each of width at most K. It implies that the total distance crossed 
by the elements that are duplicated is at most Kp. Consequently, it is not 
possible to bring an element of / and an element of J — 1 in two consecutive 
positions. So it is necessary that the duplication-loss steps of the scenario we 
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consider are internal to / and J — 1. We can reproduce these steps in / and J 
to obtain a from 12 ... n in at most p duplication-loss steps of width at most 
K, contradicting that a C(K,p). 

It is then quite easy to prove Theorem 19: 

Theorem 19 The class C(K,p) of all permutations obtained from an identity 
permutation after p duplication-loss steps of width at most K is a class of 
pattern- avoiding permutations whose basis is finite and contains only patterns 
of size at most (Kp + 2) 2 — 2. 

PROOF. We set B = {vr : vr £ C(K,p) and |tt| < {Kp + 2) 2 - 2} and show 
that S(B) =C(K,p). 

Consider a <£ C(K,p). If \a\ < (Kp + 2) 2 - 2, then a E B and a (£ S(B). 
Otherwise, if |er| > (Kp + 2) 2 — 2, then by Proposition 18, there exists a strict 
pattern r of a that does not belong to C(K,p). Reasoning by induction on the 
size of the permutations, we deduce from r ^ C(K,p) that r ^ S(B). A direct 
consequence is that a S(B). This proves that S(B) C C(K,p). 

Conversely, consider a G C(K,p). Then any pattern r of a is also obtainable 
from an identity permutation in at most p steps of width at most K (using 
the sequence of duplication-loss steps associated with a), i.e., r e C(K,p). 
Then a does not contain an occurrence of any pattern of B, i.e., a G S(B). 
This shows that C(K,p) C S(B), ending the proof of the theorem. 



3 Number of steps of width K to obtain any permutation of size n 

The whole genome duplication - random loss model is studied in (4), and the 
authors describe a method to compute an optimal duplication-loss scenario, 
i.e., a scenario of duplications (of the whole genome in this case) and losses 
whose number of steps is minimal. 

Our model with bounded size duplication operations reduces to the whole 
genome duplication - random loss case when K = n and thus to a radix-sort 
algorithm as shown in (4) and to a bubble-sort when K = 2. Thus we give 
some algorithm whose complexity matches the two extremal cases and shows 
some continuity between the two sorting algorithms. 

It is worth noticing that any scenario in our model can be viewed as a whole 
genome duplication - random loss scenario. Consequently, the number of steps 
of an optimal whole genome duplication - random loss scenario is a lower 
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bound to the number of steps of an optimal scenario in our duplication-loss 
model. 

It is also easy to see that, when considering permutations of size at most K, our 
model and the whole genome duplication - random loss model coincide. Indeed, 
we will use for our purpose the procedure of (4), which is given in Algorithm 
1. We omit the proof of correctness and optimality of this algorithm. See (4) 
for details. 

Algorithm 1 An optimal whole genome duplication - random loss scenario 
from 12 ... K to a e Sk 
1: 7r = 12 ... K 

2: Partition a into maximal increasing substrings, from left to right 

3: Each element of appearing in the i th maximal increasing substring 

gets as a label the binary representation of % 
4: for j — 1 to \\og 2 (desc(a) + 1)] do 

5: Perform a duplication-loss step on n that keeps in the first copy of n 

exactly the elements whose label has a in its j th least significant bit 
6: end for 



In order to examine every bit of the labels given to the elements of [1..-K], the 
number of steps in the loop on line 4 is [log 2 (number of maximal increasing 
substrings of er)] = |~log 2 (desc(a) + 1)]. A consequence is that the number 
of steps in an optimal whole genome duplication - random loss scenario from 
12 ... n to a is 0(logn) in the worst case and on average (see equation (1) for 
the average case). 

Note that the same algorithm can be used to compute an optimal whole 
genome duplication - random loss scenario from i\i 2 . . . ik, with k < K and 
i\ < i 2 < ■ ■ ■ < ik, to any permutation of {ii,i2, ■ ■ ■ , ik}- 

3.1 Upper bound 

In this section, we provide an algorithm that computes, for any permutation 
a G S n in input, a possible scenario of duplications and losses to obtain a from 
12 . . .n. We will restrict ourselves to duplication-loss steps of width at most 
K, so that the number of duplication-loss steps corresponds to the cost of the 
scenario in our cost model. We are interested in the number of duplication- 
loss steps of the scenario produced by the algorithm, in the worst case, and on 
average. It provides an upper bound on the number of duplication-loss steps 
that are necessary to obtain a permutation. The algorithm we use is described 
in Algorithm 2. 

A few keys to understand Algorithm 2 are the following remarks. 
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Algorithm 2 A duplication-loss scenario from 12 . . . n to a e S n 
1: 7T <- 12. . .n 
2: for i = 1 to [g^] do 

3: Let V = {aj : n -i[K/2\ + 1 < j < n - (i - 1) L^/2j} 
4: Perform duplication-loss steps on 7r to move from left to right the ele- 
ments of L l to the positions n — i[K/2\ + 1 to n — (i — 1) L-K/2J of 7r, 
without changing their respective order 
5: end for 

6: for % = 1 to [g^] do 

7: Perform Algorithm 1 on the window of it between the indices n — 

i [K/2\ + 1 and n - (i - 1) [K/2\ 
8: end for 

9: Perform Algorithm 1 on the window of 7r between the indices 1 and n — 

The set L l of values defined at line 3 represents the rightmost YK/2\ elements 
of a not yet examined. The algorithm consists in two different loops, the 
first one corresponding to lines 2 to 5 of the algorithm and the second one 
from line 6 to 8. At the end of the first loop (line 5), 7r is decomposed into 
windows of width \Kj2\ (except the leftmost one which is of width at most 
K) ; and each of these windows is an increasing sequence containing exactly 
the same elements as the window of a corresponding to the same indices. In 
the second loop, we consider these windows from right to left and since there 
are of width less than K, we can call Algorithm 1 (that implements whole 
genome duplication-random loss) on each window successively to transform n 
into a. 

An example is given with a = 21017658934 and K = 6. We first cut a 
in chunks of size \K/2\ = 3 and obtain 2101|765|893|4. Then the first 
loop of the algorithm (step 2 to 5) starts from 12345678910 and takes 
the elements in increasing order to the same chunk the belong to in a. This 
gives 1 2 10 | 5 6 7 | 3 8 9 | 4. Then the second loop sorts each chunk separately 
to obtain a using the radix sort Algorithm 1 introduced in (4). 

Notice here that we use in the second loop (except for the leftmost window) 
only duplication- loss steps of width \_K/2\. An improvement we considered is 
to use whole genome duplication - random loss scenarios on windows of width 
K, that are nonetheless increasing sequences. Unfortunately, we were not able 
to analyse how many duplication-loss steps there are in a scenario produced 
by such an algorithm. 

We now analyse the number of steps of the scenario produced by Algorithm 
2. 

Proposition 20 The number of duplication-loss steps of a scenario produced 
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2 

by Algorithm 2 on a permutation of size n is at most log if + j^) asymp- 
totically. 



PROOF. Suppose we are at iteration % of the first loop. We have to move the 
\Kj2\ elements of L l to their positions (from n—i\_K/2\+l to n—(i—l) \_K/2\) 
by duplication-loss steps of width at most K. The worst situation is when the 
elements of U are at the begining of tt. But in this case, we can move to the 
right the elements of L l by [if/2] positions at every duplication-loss step, 
until they reach their position. The total number of duplication-loss steps in 
this first process is then at most 

r n-K -i 

lW7Ijl r U -t[K/2\- - 2 



E 



[if/2] 



Consider now the second loop of Algorithm 2. In each window of size \K/2\ , it 
performs at most [log[if/2j] duplication- loss steps (line 7) and in the leftmost 
window (line 9), at most [log if] by the result of (4). Consequently the number 
of duplication-loss steps produced by the second loop is 



n-K 
[Kj2\ 



77 

\\o g [K/2\] + \\ogK]=Q(-\ogK). 



We finally get that the total number of duplication-loss steps in a scenario 
produced by Algorithm 2 is at most 0(|blogif + j^) asymptotically in the 
worst case. 



It is easily noticed that this worst case corresponds to the reversed identity 
permutation n(n — 1) . . . 21. This corresponds to our intuition of a worst case 

2 2 

situation in this context. We can also notice that 6(-^logif + j^) = ©(^2) 
for "small" values of K, namely as long as K = o(p^). If on the contrary 

^ = o(K) then 9(f \ogK+ = 9(f log AT). When K = 6(^), the 
two terms are of the same order. 

We can also compute the average number of duplication-loss steps of a scenario 
produced by Algorithm 2. 

Proposition 21 The number of duplication-loss steps of a scenario produced 
by Algorithm 2 on a permutation of size n is on average Q(^\ogK + j^) 
asymptotically. 



PROOF. First, we introduce a few notations. Consider a a permutation of 



size n, and decompose it from right to left into p 



n-K 
[K/2] 



+ 1 windows of 
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width \K/2\ , except the leftmost one, whose width is n — 
We denote a = o x o 2 . . . o p this decomposition. 



n-K 
[K/2\ 



[K/2\ < K. 



Now, let us denote c(a) the number of duplication-loss steps produced in the 
first loop of Algorithm 2 on a. And in particular, we denote c p (a) the number 
of such steps produced by the first iteration of this loop, i.e., the number of 
steps to move the elements of L 1 at the end of the permutation. For computing 
the average number of such steps, we consider u n = So-e5„ c (°")- ^ * s simple 
to conceive that 



u n = E c p (a) + cia 1 . . . a p x ) 

cr£Sn 



c p (a) + n(n - 1) . . . (n-[K/2\+l) £ c(a) 



"'SS n _|_jf/2J 



E c » + 



rv. 



<r£S n 



[n-\_K/2\)\ 



u 



n-[K/2\ ■ 



Let us focus on J2aes n c p( (J )- Figure 8 should convince the reader that 



n + 1 - [K/2\ - min(a p ) 
K 



< Cp(a) < 



n + 1 - L^/2J - min((T p ) 



n + 1, — min(a p ) elemen ts 

vp- vec tors 



a 



positions: 12 • • •min(cr p 



[_K/2\ rightmost positions 
Fig. 8. Bounding c p (a) 



Now, we can notice that the number of permutations a of size n such that 
min(a p ) = i is (^-i) ( n ~ l K / 2 \) l l K / 2 \ l This y ields 
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^ n + 1 - [K/2\ - mm{(r p ) 

n-[K/2\ + l 



E (« + 1 " L^/2J - ( [Rm L J (n - L^/2J )! L^/2J ! 
(n-|tf/2j)!L*/2j! E (< + l-L^/2j)| 



n — i 



i=[K /2\-l 



n-l 



= (n-[K/2\)\[K/2\\[K/2\ £ ,^ /9 , 



(n-L^/2j)!LK/2j!LK/2j 



,^/2j+iy- 



Consequently, 



E*W^(»-L^j).L^j.( Lir/2J + v 

EcpW >m ( „- W 2j)!^/2j!( LJf/2 " J + , 



giving after a few computations 



In- L-^/ 2 J u n -\K/2\ 



< u Ii< n- [K/2\ 



Ztr>.— 



n-[K/2\ 



3 L^/2J +1 (n - [K/2\)\ ~ n\ ~ [K/2\ +1 (n - [K/2\)l 



Therefore, we consider two sequences (v n ) and (w n ) satisfying the relations 

Vn = I "k/2J+i + W «-L^/2J and w " = "^/^j + w n ^K/ 2 \ respectively if n > K, 
and v n = w n = for any n < K. Then we have v n < < w n \/n G N. 
We can solve the recurrence equations for v n and w n ; and if we write n = 
[K/2\ + r (then [K/2\ <r<K),we get: 



1 1 n-i\K/2\ 



3 ^ L^/2J+1 



1 



3(|tf/2j + l) 

«4> 



W2j 



[Ml + 1 



+ v r 
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and 



w 



n= E [Km + 1 +^ = e ( -) 



i=l 



Consequently, the average number of duplication-loss steps produced by the 
first loop of Algorithm 2 on permutations of size n is = 0(jp). 

What is left to compute is the average number of duplication-loss steps pro- 
duced by the second loop of Algorithm 2 on permutations of size n. This 
number is given by 



^ E Enog(^V) + i)i 

H - a£S n 1=1 

n! 



E E Rog(^cK) + 1)1+5: flogCdeac^ 1 ) + 1)] 

i=2 adS n o"G5 n 



^(t(n-[K/2\)\ 



n 



+ (^-k 1 |)!f, n i| ) E \^g{desc{a) + 1)1 



(p-i) 5: rio g (desc(<T) + 1)1 

|K/2j 



5: rio g (desc( ( 7) + 1)1 



E flog(deac((7) + l)l. 



creS, 



l CT ll 



Since p 



L^/2J 



+ 1, we deduce that the average number of duplication- loss 



steps produced by the second loop of Algorithm 2 on permutations of size n 



is 0( 



[K/2}\ 



n-K 
[K/2] 



+ 1 ) J2aes [K/2i \log(desc(a) + 1)1). Hence we focus on the 



computation of J2aeSk \^og(desc(a) + 1)1) for k = [K/2\. By concavity of 
nction, since p J2aes k desc(a) + 1 = we get that 



the log function 
1 



]A E \log(desc(a) + 1)1 > I £ log(de S c(a) + 1) > log(^). 



Moreover, it is clear that 



- J2 \log(desc(a) + 1)1 < flog(fc)l, 
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so that we deduce that 

i £ \\og(desc(a) + 1)] = 9(log(*)). (1) 

Consequently, the average number of duplication-loss steps produced by the 
second loop of Algorithm 2 on permutations of size n is 9 ( [ 1 L-^V^J )) = 
9(flogX). 

Finally, we end the proof concluding that the total number of duplication-loss 
steps in a scenario produced by Algorithm 2 on a permutation of size n is 
9(-^ \ogK + j^) on average. 



3.2 Lower bound 

It is possible to provide very simple lower bounds on the number of duplication- 
loss steps necessary to obtain a permutation. These lower bounds are given 
and proved in Propositions 22 and 23 below. They are tight in most cases, 
however not in any case. Indeed the upper and lower bounds coincide up to a 
constant factor whenever K is a constant, or when K = K(n), except when 
1^ « K{n) « n. 

Proposition 22 In the worst case, f2(logn + j^j) duplication-loss steps of 
width K are necessary to obtain a permutation of S n from 123 . . . n. 



PROOF. Let us consider first the number of inversions in a permutation that 
can create a duplication-loss step s of width K. It is easily seen that these new 
inversions can only involve two elements of s. Call % the number of elements of 
s that are kept in the first copy. Then the maximum number of inversions that 
can be created by s is i(K — i) < Now, a permutation a G S n has up to 
" ( - ra ~ 1 - ) inversions, so that at least 2ra ^ 1 - ) duplication-loss steps are necessary 
to transform 123 ... n into a. 

To get the other term of the lower bound, we just refer to the result of (4) 
recalled at the beginning of this section, namely that log n steps are necessary 
in the worst case in the whole genome duplication - random loss model, in 
which duplication-loss operation are less restricted. 

2 

Finally, we get a lower bound of f2(logn+ -g^) necessary duplication-loss steps 
to obtain a permutation of S n from 123 ... n in the worst case. 
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Proposition 23 On average, Q(logn+ j^) duplication-loss steps of width K 
are necessary to obtain a permutation of S n from 123 . . . n. 



PROOF. As before, a duplication-loss step can create at most ^- inversions 
in a permutation. But the average number of inversions in a permutation of S n 
is n( - n ~^ ; so that on average at least "^"^ duplication- loss steps are necessary 
to transform 123 . . . n into a G S n . 

Again, (4) provides use with the fi(logn) lower bound, referring to the whole 
genome duplication - random loss model which is more general than ours, so 
that this bound applies in our context. 

We conclude that a lower bound on the average number of duplication-loss 
steps necessary to obtain a permutation of S n from 123 ... n is fi(logn + jfr). 



4 Conclusion 

We discuss the results of Section 3 on the average (or worst case) number 
of steps of width K to obtain a permutation of size n. It appears that we 
could not provide lower bounds that coincide with the upper bounds given 
by our algorithm, but we claim that they are tight in many cases. Indeed, 
whenever K = o(^ L ^), we get that j£\ogK = o(j^), and consequently the 

2 2 

upper bound can be rewritten as Q(^\ogK + j^) = Q(jfa), which coincide 

2 2 

up to a constant factor with the lower bound f2(logn + j^j) = fi(j^j). For the 
case K = Qi^—), the same argument holds, but the constant factor between 

v log n ' ' ° ' 

the lower and the upper bound might be much greater. Finally, if K — B(n), 
then Q(j£ log If + j^) = 6(logn) and ^(logn+7^2) = ^(l°g n )> so tnat upper 
and lower bounds coincide again. 

On the contrary, when -^-^ <C K n, the upper and lower bounds provided 
do not coincide. We leave as an open question the problem of finding an 
algorithm that computes a duplication-loss scenario whose number of steps is 
optimal (on average and in the worst case) up to a constant factor, when the 
width K of the duplicated windows satisfies <C K <^ n. 

Several other questions are still open. First of all neither of our algorithms is 
optimal for a specific permutation and our results are only optimal asymptot- 
ically in average and/or in the worst case. It could be interesting to provide 
algorithms that produce optimal duplication-loss scenarios on any permuta- 
tion <7, for K = K{n) in order to provide some continuity between the bubble 
sort (corresponding to K = 2) and the radix sort (corresponding to K(n) = n). 
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